{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This script accompanies the manuscript \"Leveraging Researcher Domain Expertise to Annotate Concepts within Imbalanced Data\".\n",
    "\n",
    "In that paper, we describe a method for combining researcher domain expertise with an unsupervised exploration of the latent semantic space, to annotate theory-driven categories for a supervised classifier.\n",
    "\n",
    "Here, we lay out the procedure for the simulations we ran to compare empirically our proposed method and two leading annotation approaches - random sampling and active learning.\n",
    "\n",
    "We ran these simulations on two datasets:\n",
    "\n",
    "1) 20 NewsGroups\n",
    "\n",
    "2) New York Times Front Page Dataset (Boydstun, 2013) with Comparative Agendas Project (CAP) categories (Dowding et al., 2015)\n",
    "\n",
    "In this script, we run on the New York Times Front Page dataset, but the procedure is the same for any fully-classified corpus. We have also prepared and uploaded document-level embeddings for the 20 NewsGroups dataset for replication of two of the simulations cited within our paper, yet such simulations run on slightly different category distribution schemes (and are explained in a script accompanying this one).\n",
    "\n",
    "May 2022"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create embeddings for the full corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start with our dataset, wanting to convert each document into embeddings on the shared latent semantic space. While a number of document-embedding techniques exist, here we use SentenceTransformers (https://huggingface.co/sentence-transformers/stsb-mpnet-base-v2) - pre-trained language models optimized for semantic similarity at the sentence level (Reimers and Gurevych, 2019). However, any other sentence or document embedding model could be used for this preparation stage.\n",
    "\n",
    "The SentenceTransformers are trained for the task of converting sentences (text segments) into embeddings. The New York Times Front page dataset includes full documents (having extracted the full articles to the existing dataset via LexisNexis). Thus, in that case we split the full articles into sentence segments, before converting each such segment into an embedding. Then, we found the mean vector for the document-level embedding.\n",
    "\n",
    "We have uploaded the document-embeddings for both datasets, but we note that there exist many different embedding models that users can try to utilize for improved results (or, models more relevant for their specific research cases). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "import numpy as np\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### DIRECTORY WITH THE DATASET:\n",
    "## change to the directory location on user computer\n",
    "\n",
    "DIR = r\"E:\\Simulation_Replication Materials\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Load the embeddings for the corpus\n",
    "embds = os.path.join(DIR, \"NYT_FrontPage_forSimulation_Embeddings.pickle\")\n",
    "\n",
    "pickle_in = open(embds,\"rb\")\n",
    "embeddings = pickle.load(pickle_in)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Load the CSV with the metadata for each embedding - text, category, etc.\n",
    "\n",
    "df = pd.read_csv(os.path.join(DIR, 'NYT_FrontPage_forSimulation_Metadata.csv'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Running the Simulations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this stage we have our dataset fully converted to document embeddings, as well as our original dataframe containing the metadata necessary for this simulation - each document's true labels.\n",
    "\n",
    "For the simulation itself, we will choose randomly a single domain (i.e., a theoretical area of interest) to focus on in each run. We will then aggregate the rest of the domains to a single category 'Other'. Then, in the simulation itself, the classifiers will be tasked with classifying the multiple sub-categories within the domain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Add the document embeddings to the dataframe\n",
    "\n",
    "df['X'] = [np.array(x) for x in embeddings]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Create lists of major categories, sub-categories:\n",
    "\n",
    "CATEGORIES = list(set(df['Category']))\n",
    "DOMAINS = list(set(df['Domain']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Keep only those Domains with at least 1000 articles to allow the simulation to run fully\n",
    "TOP_DOMAINS = [\"International Affairs and Foreign Aid\", \"Defense\", \n",
    "               \"Government Operations\", \"Law, Crime, and Family Issues\", \n",
    "               \"Health\", \"Banking, Finance, and Domestic Commerce\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Create the randomizer functions to randomly choose the major topic domains in each simulation round:\n",
    "\n",
    "import random\n",
    "\n",
    "## Choose the random seeding to allow for replication of the procedure\n",
    "np.random.seed(17)\n",
    "\n",
    "def choose_random_domain(domains=DOMAINS):\n",
    "    ## Return a single randomly chosen major topic domain\n",
    "    return(random.sample(domains, 1))\n",
    "\n",
    "def filter_small_categories(df, domain_to_keep):\n",
    "    ## In some cases, some subtopics were too small even for the preliminary sampling stage\n",
    "    ## Thus, we include a filter to remove these categories from our simulation:\n",
    "\n",
    "    temp = df[df['Domain'] == domain_to_keep]\n",
    "    \n",
    "    ## Create list of the sub-topics:\n",
    "    categories_ = list(set(temp['Category']))\n",
    "    \n",
    "    ## Create a new, empty dataframe to begin to fill with the categories we will target\n",
    "    new_df = pd.DataFrame()\n",
    "    \n",
    "    ## Run through the list of sub-topics to make sure they contain at least 100 documents\n",
    "    for cat in categories_:\n",
    "        subset = temp[temp['Category'] == cat]\n",
    "        if len(subset) >= 100:\n",
    "            new_df = pd.concat([new_df, subset])\n",
    "    \n",
    "    ## Now, join the filtered domain of interest back to the rest of the dataframe\n",
    "    other2 = df[df['Domain'] != domain_to_keep]\n",
    "    new_df = pd.concat([new_df, other2])\n",
    "            \n",
    "    return new_df\n",
    "\n",
    "def change_labels(domain_to_keep, df_):\n",
    "    ## Keep the domain of interest, while aggregating the other majortopics to a single 'Other' category\n",
    "    \n",
    "    new_df = df_.copy()\n",
    "    \n",
    "    ## Change all the other domains to other\n",
    "    other_df = new_df[new_df['Domain'] != domain_to_keep]\n",
    "    other_df['Category'] = 'Other'\n",
    "    \n",
    "    ## Keep only the categories from the target domain\n",
    "    subset = new_df[new_df['Domain'] == domain_to_keep]\n",
    "    \n",
    "    ## Join together for the new dataframe ready for the simulation\n",
    "    changed_df = pd.concat([other_df, subset])\n",
    "\n",
    "    return changed_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we choose the parameters for the simulation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## How many times to run the simulation:\n",
    "RUNS = 300\n",
    "\n",
    "## How many samples to extract at each sampling iteration:\n",
    "STEPSIZE = 18\n",
    "\n",
    "## Number of iterations for each simulation run \n",
    "## (The more iterations, the more sampling rounds executed)\n",
    "NUM_STEPS = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Imports for simulation functions:\n",
    "\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.model_selection import train_test_split\n",
    "from scipy import stats\n",
    "from scipy.stats import entropy\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "## TQDM Notebook - not required, but allows the user to track simulation progress:\n",
    "from tqdm import tqdm_notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Additional functions needed for the simulation:\n",
    "\n",
    "## Trains an SVM classification model on X and y and returns the classifier\n",
    "def train_svm(X, y, proba=False): # add hyperparameter tuning\n",
    "    clf = SVC(probability=proba)\n",
    "    ## Fit the model on the training set:\n",
    "    clf.fit(X, y)\n",
    "    return clf\n",
    "\n",
    "## Takes as an input a trained SVM model, runs on the test set, and returns the accuracy results\n",
    "def create_classification_report(clf, testX, testY, output_dict=True):\n",
    "    ## Use the model to make predictions on the test set:\n",
    "    predY = clf.predict(testX)\n",
    "    results = classification_report(le.inverse_transform([int(x) for x in testY]), le.inverse_transform([int(x) for x in predY]),\n",
    "                                    output_dict=output_dict)\n",
    "    return results\n",
    "\n",
    "## For our method, after the initial sampling and each ensuing round of stratified sampling, \n",
    "## need to calculate the centroids for each concept of interest\n",
    "\n",
    "## At each stage, run on the dataframe of the expanding training set\n",
    "def calculate_centroids(df, categories):\n",
    "    centroids_ = []\n",
    "    \n",
    "    for c in categories:\n",
    "        ## Run over each subcategory:\n",
    "        temp = df[df['y'] == c]\n",
    "        \n",
    "        vecs = []\n",
    "        ## Find the mean for the document embbedings - the centroid\n",
    "        for i,r in temp.iterrows():\n",
    "            vecs.append(r.X)\n",
    "        meanvec = np.mean(vecs, axis=0)\n",
    "        centroids_.append(meanvec)\n",
    "    \n",
    "    ## Return the centroids\n",
    "    return centroids_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we run the actual simulation. Each run begins with the compilation of an initial training set, based on each strategy we want to compare: Completely random sampling for \"random sampling\" and \"active learning\", and a sampling of each target category for our method - corresponding to the role of human experts in the creation of the core training set.\n",
    "\n",
    "In the next runs, we continue to expand the three training sets in parallel. For random sampling we simply add additional samples, for active learning we add those instances the model is most uncertain on, and for our method we utilize stratified sampling.\n",
    "\n",
    "Between each iteration, we train a SVM model on the updated training sets for each parallel method, and record the accuracy scores for comparison later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Function to actually execute the simulation for each run\n",
    "\n",
    "def run_annotations(pooldf, filtered_categories, test, STEPSIZE=STEPSIZE, NUM_STEPS=NUM_STEPS):\n",
    "    ## First, define a number of lists and empty dataframes which we will continue to fill through each iteration\n",
    "    ## These will be returned in the output and will later be used to record the efficiency and accuracy\n",
    "    ## of each method:    \n",
    "    \n",
    "    ## List to record the number of samples in the training set at each iteration\n",
    "    samples = [] # number of samples\n",
    "\n",
    "    ## Lists to hold the f1 accuracy scores for each method over each iteration\n",
    "    f1s_random = [] # random sampling\n",
    "    f1s_active = [] # active learning\n",
    "    ## Note - we will run two versions of our stratified sampling method in this simulation to offer\n",
    "    ## a more comprehensive exploration of our method. The only thing we change here are the sizes and distances\n",
    "    ## of the hierarchal strata. This is due to the varying densities and distances of the latent semantic spaces for\n",
    "    ## different corpora/datasets. However, the guiding principle remains the same.\n",
    "    f1s_ours_original = [] # Equally-spaced strata covering the full 0-1 cosine distance from each centroid\n",
    "    f1s_ours_closer = [] # Smaller strata focusing on the areas closer to the centroids, not covering the full distance\n",
    "\n",
    "    ## The pool of unlabelled documents left for each method\n",
    "    ## At the start, the whole corpus is still unlabelled:\n",
    "    pooldf_random = pooldf.copy()\n",
    "    pooldf_active = pooldf.copy()\n",
    "    pooldf_ours_original = pooldf.copy()\n",
    "    pooldf_ours_closer = pooldf.copy()\n",
    "\n",
    "    ## The collection of annotated samples - i.e., the expanding training set\n",
    "    ## We start with the initial sampling based on each approach:\n",
    "    ## For random sampling and guided search, take a random sample from the full dataset:\n",
    "    annotated_random = pooldf_random.sample(20)\n",
    "    ## However, due to the imbalance in the data, need to ensure at least two categories present \n",
    "    ## in the training set for the SVM model to work:\n",
    "    if len(set(annotated_random['y'])) == 1:\n",
    "        annotated_random.drop(annotated_random.tail(1).index,inplace=True)\n",
    "        annotated_random = annotated_random.append(pooldf_ours_original.groupby(\"y\").sample(1))\n",
    "    annotated_active = annotated_random.copy()\n",
    "    ## For our approach, emulating the human element we sample from each category:\n",
    "    annotated_ours_original = pooldf_ours_original.groupby(\"y\").sample(20) #  - start with samples from all\n",
    "    annotated_ours_closer = annotated_ours_original.copy()\n",
    "\n",
    "    ## Remove the annotated texts from the unlabelled pool\n",
    "    pooldf_random = pooldf_random.drop(annotated_random.index)\n",
    "    pooldf_active = pooldf_active.drop(annotated_active.index)\n",
    "    pooldf_ours_closer = pooldf_ours_closer.drop(annotated_ours_closer.index)\n",
    "    pooldf_ours_original = pooldf_ours_original.drop(annotated_ours_original.index)\n",
    "    \n",
    "    ###############################################################################################\n",
    "    ################################ BEGIN SAMPLING ITERATIONS ####################################\n",
    "    ###############################################################################################\n",
    "    \n",
    "    ## Main loop - based on the number of iterations defined earlier:\n",
    "    for i in range(NUM_STEPS):\n",
    "        \n",
    "        ### add in the first step as taking the preliminary sample collection\n",
    "        \n",
    "        if i == 0:\n",
    "            annotated_ours_original = annotated_ours_original\n",
    "            annotated_ours_closer = annotated_ours_closer\n",
    "            annotated_random = annotated_random\n",
    "            annotated_active = annotated_active\n",
    "        \n",
    "        elif i > 0:\n",
    "        \n",
    "            ## Now, run through each method in parallel:\n",
    "\n",
    "            ############################ OUR METHOD - ORIGINAL #################################\n",
    "            ### Our method - sample from each strata for a category chosen at random\n",
    "\n",
    "            c = calculate_centroids(annotated_ours_original, filtered_categories) # Find the centroids for each category\n",
    "            similarities = cosine_similarity(np.array(pooldf_ours_original[\"X\"].to_list()), c) # Find distances of all documents from each centroid\n",
    "\n",
    "            for ii, cc in enumerate(filtered_categories): # Add column with distance from each centroid for each document in pool\n",
    "                pooldf_ours_original[cc] = [x[ii] for x in similarities]\n",
    "\n",
    "            ## The hierarchal strata:\n",
    "            strata_o = [(0.8, 0.9), (0.7, 0.8), (0.6, 0.7), (0.5, 0.6), (0.4, 0.5), \n",
    "                      (0.3 ,0.4)] \n",
    "\n",
    "            ## Randomly choose category\n",
    "            cat_choice = random.choice(filtered_categories)\n",
    "            ## Calculate empirical strata for category over full distance\n",
    "            emp_strata = [np.quantile(pooldf_ours_original[str(cat_choice)], [x, y]) for x, y in strata_o]\n",
    "\n",
    "            new_ours_orig = []\n",
    "\n",
    "            ## The stratified sampling:\n",
    "            for strata in strata_o:\n",
    "                ## Select only subset corresponding to specific strata\n",
    "                temp = pooldf_ours_original[(pooldf_ours_original[str(cat_choice)] > strata[0]) & (pooldf_ours_original[str(cat_choice)] <= strata[1])]\n",
    "\n",
    "                if len(temp) < (STEPSIZE / (len(strata_o))): # Ensure no errors due to sampling larger than the strata\n",
    "                    newsamp = temp\n",
    "                    new_ours_orig.append(newsamp)\n",
    "                else:\n",
    "                    newsamp = temp.sample(int(STEPSIZE / (len(strata_o))))\n",
    "                    new_ours_orig.append(newsamp)\n",
    "\n",
    "                ## Remove samples from remaining pool\n",
    "                pooldf_ours_original = pooldf_ours_original.drop(newsamp.index)\n",
    "\n",
    "            ## The stratified samples:\n",
    "            new_ours_original = pd.concat(new_ours_orig)\n",
    "\n",
    "            ############################ OUR METHOD - MODIFIED STRATA #################################\n",
    "            ### Our method, but strata closer to the centroids\n",
    "\n",
    "            c = calculate_centroids(annotated_ours_closer, filtered_categories)\n",
    "            similarities = cosine_similarity(np.array(pooldf_ours_closer[\"X\"].to_list()), c)\n",
    "\n",
    "            for ii, cc in enumerate(filtered_categories):\n",
    "                pooldf_ours_closer[cc] = [x[ii] for x in similarities]\n",
    "\n",
    "            ## Closer strata for larger, more spread out corpus\n",
    "            strata_o = [(0.95, 1), (0.9, 0.95), (0.85, 0.9), (0.8, 0.85), (0.7, 0.8), \n",
    "                      (0.6 ,0.7)]   \n",
    "\n",
    "            cat_choice = random.choice(filtered_categories)\n",
    "            emp_strata = [np.quantile(pooldf_ours_closer[str(cat_choice)], [x, y]) for x, y in strata_o]\n",
    "\n",
    "            new_ours_closer = []\n",
    "\n",
    "            for strata in strata_o:\n",
    "                temp = pooldf_ours_closer[(pooldf_ours_closer[str(cat_choice)] > strata[0]) & (pooldf_ours_closer[str(cat_choice)] <= strata[1])]\n",
    "\n",
    "                if len(temp) < (STEPSIZE / (len(strata_o))):\n",
    "                    newsamp = temp\n",
    "                    new_ours_closer.append(newsamp)\n",
    "                else:\n",
    "                    newsamp = temp.sample(int(STEPSIZE / (len(strata_o))))\n",
    "                    new_ours_closer.append(newsamp)\n",
    "\n",
    "                pooldf_ours_closer = pooldf_ours_closer.drop(newsamp.index)\n",
    "\n",
    "            new_ours_closer = pd.concat(new_ours_closer)\n",
    "\n",
    "\n",
    "            ############################ RANDOM SAMPLING #############################################\n",
    "            new_random = pooldf_random.sample(STEPSIZE) \n",
    "            pooldf_random = pooldf_random.drop(new_random.index)\n",
    "\n",
    "\n",
    "            ############################ ACTIVE LEARNING #############################################\n",
    "            ## Train svm on the training set\n",
    "            clf = train_svm(annotated_active[\"X\"].to_list(), annotated_active[\"y\"].to_list(), proba=True)\n",
    "            ## Add column with entropy of each prediction\n",
    "            pooldf_active[\"metric\"] = entropy(clf.predict_proba(pooldf_active[\"X\"].to_list()).transpose())\n",
    "            ## Sort by descending uncertainty\n",
    "            pooldf_active = pooldf_active.sort_values(by=\"metric\", ascending=False)\n",
    "            ## Sample:\n",
    "            new_active = pooldf_active.iloc[0:STEPSIZE]\n",
    "            pooldf_active = pooldf_active.drop(new_active.index)\n",
    "\n",
    "\n",
    "        ###############################################################################################\n",
    "        ###################################### END SAMPLING STAGE #####################################\n",
    "        ###############################################################################################\n",
    "\n",
    "            ## Add newly sampled texts to each of the training sets\n",
    "            annotated_random = pd.concat([annotated_random, new_random])\n",
    "            annotated_active = pd.concat([annotated_active, new_active])\n",
    "            annotated_ours_original = pd.concat([annotated_ours_original, new_ours_original])\n",
    "            annotated_ours_closer = pd.concat([annotated_ours_closer, new_ours_closer])\n",
    "\n",
    "        ## Train intermediate models\n",
    "        model_random = train_svm(annotated_random[\"X\"].to_list(), annotated_random[\"y\"].to_list())\n",
    "        model_active = train_svm(annotated_active[\"X\"].to_list(), annotated_active[\"y\"].to_list())\n",
    "        model_ours_original = train_svm(annotated_ours_original[\"X\"].to_list(), annotated_ours_original[\"y\"].to_list())\n",
    "        model_ours_closer = train_svm(annotated_ours_closer[\"X\"].to_list(), annotated_ours_closer[\"y\"].to_list())\n",
    " \n",
    "        ## Evaulate and save results\n",
    "        f1s_random.append(create_classification_report(model_random, test[\"X\"].to_list(), test[\"y\"]))\n",
    "        f1s_active.append(create_classification_report(model_active, test[\"X\"].to_list(), test[\"y\"]))\n",
    "        f1s_ours_closer.append(create_classification_report(model_ours_closer, test[\"X\"].to_list(), test[\"y\"]))\n",
    "        f1s_ours_original.append(create_classification_report(model_ours_original, test[\"X\"].to_list(), test[\"y\"]))\n",
    "        \n",
    "        ## Save number of samples\n",
    "        samples.append(annotated_random.shape[0])\n",
    "        \n",
    "        ## Return accuracy results for each method, number of samples\n",
    "        to_return = [f1s_random, f1s_active, f1s_ours_original, f1s_ours_closer, samples]\n",
    "\n",
    "    return to_return"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Convert categories to integer labels\n",
    "from sklearn import preprocessing\n",
    "le = preprocessing.LabelEncoder()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Function to run the full simulation from start to finish (one run)\n",
    "def run_simulation(pooldf, RUNS=RUNS, STEPSIZE=STEPSIZE, NUM_STEPS=NUM_STEPS):\n",
    "    \n",
    "    ## First, compile/prepare the dataframe for each run\n",
    "    \n",
    "    ## Choose a domain at random:\n",
    "    dom_ = choose_random_domain(TOP_DOMAINS)[0]\n",
    "    \n",
    "    ## Then, prepare the dataframe around the chosen major topic/domain\n",
    "    converted_ = change_labels(dom_, pooldf)\n",
    "    converted_ = filter_small_categories(converted_, dom_)\n",
    "\n",
    "    ## Assign numerical values for the categorical variable and store in another column\n",
    "    converted_['y'] = le.fit_transform(converted_['Category'])\n",
    "    \n",
    "    ## Split the dataset into test set and pool set\n",
    "    X_pool, X_test, y_pool, y_test = train_test_split(converted_['X'], converted_['y'], test_size=0.20, random_state=42)\n",
    "    y_pool = [str(x) for x in y_pool]\n",
    "    y_test = [str(x) for x in y_test]\n",
    "    pooldf = pd.DataFrame({\"X\": X_pool.tolist(), \"y\": y_pool})\n",
    "    test = pd.DataFrame({\"X\": X_test.tolist(), \"y\": y_test})\n",
    "    \n",
    "    new_categories = list(set(pooldf['y']))\n",
    "    \n",
    "    ## Run the simulation:\n",
    "    outputs = run_annotations(pooldf, new_categories, test=test)\n",
    "    \n",
    "    return outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "### Create dictionaries to hold all information for each run\n",
    "## Each key will be a run and the values will correspond to the f1 scores and samples\n",
    "## for each parallel method of the simulation\n",
    "\n",
    "random_dict = {}\n",
    "active_dict = {}\n",
    "ours_dict_closer = {}\n",
    "ours_original_dict = {}\n",
    "samples_dict = {}\n",
    "\n",
    "\n",
    "for run in tqdm_notebook(range(RUNS)):\n",
    "    outputs = run_simulation(df) # Return the outputs from the simulation's run\n",
    "    random_dict.setdefault(str(run),outputs[0])\n",
    "    active_dict.setdefault(str(run),outputs[1])\n",
    "    ours_original_dict.setdefault(str(run),outputs[2])\n",
    "    ours_dict_closer.setdefault(str(run),outputs[3])\n",
    "    samples_dict.setdefault(str(run),outputs[4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing the Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the advantages of utilizing a computerized simulation is the ability to run multiple experiments/iterations of our empirical comparisons. Now, we can ascertain the average performance for each method being tested. This helps to ensure that the results we find are due to the approach chosen, and not to idiosyncrasies of category choice or other inherent random elements. \n",
    "\n",
    "Here we find the average accuracy levels (vector) for each method and compare via a graph.\n",
    "\n",
    "The outputs are returned in dictionary form, and can be saved to file as pickles for revisiting later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Find the mean vector for each dictionary (each dictionary corresponds to an annotation strategy)\n",
    "def extract_macros_mean(results_dict):\n",
    "    vecs = []\n",
    "    for row in results_dict:\n",
    "        vec = []\n",
    "        for i in (results_dict[row]):\n",
    "            vec.append(i['macro avg']['f1-score'])\n",
    "        vecs.append(vec)\n",
    "    mean_ = np.mean([v for v in vecs], axis=0)\n",
    "    return mean_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Visualize the performances\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(figsize=(12,8))\n",
    "## Note, sample size remains consistent throughout the multiple runs\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(random_dict), label='Random Sampling')\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(active_dict), label='Active Learning')\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(ours_original_dict), label='Our Method')\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(ours_dict_closer), label='Our Method - modified')\n",
    "plt.xlabel(str('# of training samples'), fontsize=14, fontweight='bold')\n",
    "plt.ylabel(str('f1 (macro)'), fontsize=14, fontweight='bold')\n",
    "plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n",
    "plt.setp(plt.gca().get_legend().get_texts(), fontsize='14')\n",
    "plt.tight_layout()\n",
    "\n",
    "### To save the graph:\n",
    "# plt.savefig('Comparing_Annotation_Approaches.png')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (dror ch)",
   "language": "python",
   "name": "dror_ch"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
